JFruit2 Optimisation¶
This tutorial will cover the use of black-box optimisation on the JFruit2 simulation model.
Black-box optimisation applied to the JFruit2 fruit model means searching for parameter values, like growth rates, transpiration coefficients, or sugar transport constants that make the model outputs best match observed fruit data, without needing access to or derivatives of the internal equations.
The optimiser only "sees" the model's inputs and outputs, evaluating how well each trial run fits experimental or target outcomes. This approach is especially useful when the biological processes are too complex, nonlinear, or computationally expensive to model analytically, allowing calibration of JFruit2 purely through iterative evaluation and intelligent search.
Imports¶
We will first import all required dependencies.
import pandas as pd
import numpy as np
from jfruit2 import JFruit2
import os.path as osp
import os
import shutil
from calisim.data_model import (
DistributionModel,
ParameterDataType,
ParameterSpecification,
)
from calisim.optimisation import OptimisationMethod, OptimisationMethodModel
from calisim.statistics import MeanSquaredError, MeanPinballLoss
Observed data¶
We will next load the observed field data.
observed_data = JFruit2.get_observed_data()
observed_data
| age_d | age_h | pip | piv | pp | pv | sfstone | af | ctcs | pif | ... | w | s | pf1 | pf2 | pf | v | dD | mSol | mSta | mSyn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 1 | 13.778 | 13.778 | 9.8344 | 9.8344 | 0.0 | 0.57664 | 0.000267 | 11.961 | ... | 0.033638 | 0.010665 | 0.98543 | 4.7757 | 0.98543 | 0.040303 | -0.37917 | 0.000646 | 0.000103 | 0.004049 |
| 1 | 2 | 2 | 13.774 | 13.774 | 9.8308 | 9.8308 | 0.0 | 0.58355 | 0.000264 | 11.876 | ... | 0.034391 | 0.010694 | 0.98872 | 5.1422 | 0.98872 | 0.041074 | -0.74167 | 0.000654 | 0.000103 | 0.004054 |
| 2 | 2 | 3 | 13.774 | 13.774 | 9.8304 | 9.8304 | 0.0 | 0.59076 | 0.000261 | 11.804 | ... | 0.035182 | 0.010724 | 0.98789 | 5.0852 | 0.98789 | 0.041885 | -1.08540 | 0.000661 | 0.000103 | 0.004059 |
| 3 | 2 | 4 | 13.771 | 13.771 | 9.8273 | 9.8273 | 0.0 | 0.59815 | 0.000258 | 11.716 | ... | 0.036000 | 0.010754 | 0.98967 | 5.3017 | 0.98967 | 0.042721 | -1.41250 | 0.000668 | 0.000103 | 0.004064 |
| 4 | 2 | 5 | 13.767 | 13.767 | 9.8238 | 9.8238 | 0.0 | 0.60613 | 0.000254 | 11.619 | ... | 0.036889 | 0.010784 | 0.99307 | 5.6945 | 0.99307 | 0.043629 | -1.70830 | 0.000676 | 0.000103 | 0.004070 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3787 | 159 | 20 | 13.827 | 13.827 | 9.8834 | 9.8834 | 0.0 | 194.40000 | 0.000611 | 20.593 | ... | 190.860000 | 39.733000 | 0.94875 | 10.3730 | 0.94875 | 215.700000 | -3046.20000 | 8.391700 | 1.092500 | 7.243600 |
| 3788 | 159 | 21 | 13.821 | 13.821 | 9.8781 | 9.8781 | 0.0 | 194.45000 | 0.000610 | 20.527 | ... | 190.940000 | 39.733000 | 0.94891 | 12.5310 | 0.94891 | 215.780000 | -3047.00000 | 8.392100 | 1.092100 | 7.243600 |
| 3789 | 159 | 22 | 13.817 | 13.817 | 9.8737 | 9.8737 | 0.0 | 194.50000 | 0.000610 | 20.470 | ... | 191.030000 | 39.733000 | 0.94898 | 13.5650 | 0.94898 | 215.870000 | -3047.80000 | 8.392600 | 1.091700 | 7.243600 |
| 3790 | 159 | 23 | 13.814 | 13.814 | 9.8711 | 9.8711 | 0.0 | 194.56000 | 0.000610 | 20.433 | ... | 191.130000 | 39.733000 | 0.94902 | 14.1340 | 0.94902 | 215.970000 | -3048.50000 | 8.393100 | 1.091300 | 7.243600 |
| 3791 | 159 | 24 | 13.813 | 13.813 | 9.8693 | 9.8693 | 0.0 | 194.61000 | 0.000610 | 20.406 | ... | 191.230000 | 39.734000 | 0.94901 | 13.9960 | 0.94901 | 216.070000 | -3049.20000 | 8.393800 | 1.090900 | 7.243600 |
3792 rows × 33 columns
Calibration procedure¶
Single-objective optimisation¶
We will next run the calibration procedures against observed fruit water mass (w) data. In this case, we will perform black-box optimisation.
We begin by specifying our parameter distributions. We will calibrate 3 parameters:
- Growth.lp: Conductivity of the composite membrane for water transport (phloem)
- Growth.lxAsy: Negative asymptote of the inverse logistic curve to calculate hydraulic conductivity
- Growth.kx: Maximum slope of the inverse logistic curve to calculate hydraulic conductivity
The ranges below were identified through consultation with several scientists at both Plant and Food Research (now the Bioeconomy Science Institute) and INRAe.
parameter_spec = ParameterSpecification(
parameters=[
DistributionModel(
name="Growth.lp",
distribution_name="uniform",
distribution_args=[0.001, 0.004],
data_type=ParameterDataType.CONTINUOUS,
),
DistributionModel(
name="Growth.lxAsy",
distribution_name="uniform",
distribution_args=[0.01, 0.015],
data_type=ParameterDataType.CONTINUOUS,
),
DistributionModel(
name="Growth.kx",
distribution_name="uniform",
distribution_args=[0.0019, 0.0038],
data_type=ParameterDataType.CONTINUOUS,
),
]
)
We'll define an objective function for our optimiser, which will return the discrepancy between observed and simulated fruit water mass values as determined by the mean squared error (MSE) metric. We'll aim to maximise the predictive accuracy of JFruit2 by minimising the MSE returned by our objective function. This is a single objective optimisation problem.
def objective(
parameters: dict, simulation_id: str, observed_data: np.ndarray | None
) -> float | list[float]:
model = JFruit2()
props = model.load_properties()
for k in parameters:
props[k] = parameters[k]
model.save_properties(props)
model.run(
properties = osp.join("data", "out", f"{model.sim_id}.properties")
)
simulated_data = model.results.w.values
observed_data = observed_data.w.values
metric = MeanSquaredError()
discrepancy = metric.calculate(observed_data, simulated_data)
return discrepancy
We'll next run our black-box optimisation procedure using the Tree-Structured Parsen Estimator (TPES) from the Optuna library.
TPES is a Bayesian optimisation method that uses two density-estimation surrogate models to learn the distribution of parameters from JFruit2 that lead to good and bad results where:
- $l(x) = p(x \mid y < y^*) $: models parameter values that produced good results
- $g(x) = p(x \mid y \ge y^*) $: models parameter values that produced bad results
Here $x$ are our parameter sets and $y^*$ is a performance threshold (often a quantile of observed losses). New candidate parameter sets are chosen to maximise the ratio l(x) / g(x). In other words, we want to evaluate solutions that look like past winners but are still uncertain. TPES is efficient for high-dimensional and mixed-type search spaces.
import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)
specification = OptimisationMethodModel(
experiment_name="optuna_optimisation",
parameter_spec=parameter_spec,
observed_data=observed_data,
method="tpes",
output_labels=["Water Mass Discrepancy"],
directions=[
"minimize"
],
n_jobs=1,
n_iterations=35,
method_kwargs=dict(n_startup_trials=10),
)
calibrator = OptimisationMethod(
calibration_func=objective, specification=specification, engine="optuna"
)
calibrator.specify().execute().analyze()
<calisim.optimisation.implementation.OptimisationMethod at 0x7faf34083c70>
The empirical distribution function plot depicts the proportion of trials containing our objective value. Roughly 32% of trials have an MSE below 2000.
We see diminishing results in MSE beyond trial 30. So that is when optimiser begins to converge.
The parallel coordinate plot shows us the relationship between our objective values against our 3 parameters using horizontal curves. Darker curves indicate better performance. For our parallel coordinate plot, we see less spread in the trajectories of the horizontal curves at Growth.lp and Growth.kx. And greater spread for Growth.lxAsy.
According to the fANOVA sensitivity analysis, Growth.lp is the most important parameter. This is consistent with the results of the Sobol sensitivity analysis.
Our slice plots show the sampling history of TPES over trials. It quickly converges to a solution.
We'll extract the point estimates from our calibrator.
optimisation_df = pd.DataFrame({
"parameter_name": [
model.name
for model in calibrator.get_parameter_estimates().estimates
],
"parameter_estimate": [
model.estimate
for model in calibrator.get_parameter_estimates().estimates
]
})
optimisation_df
| parameter_name | parameter_estimate | |
|---|---|---|
| 0 | Growth.kx | 0.003677 |
| 1 | Growth.lp | 0.001050 |
| 2 | Growth.lxAsy | 0.010132 |
We see the parameter estimates above for our 3 parameters. We can run JFruit2 again using these optimised estimates, and compare the simulated and observed fruit water mass values.
parameters = {
row["parameter_name"]: row["parameter_estimate"]
for row in optimisation_df.to_dict("records")
}
parameters
model = JFruit2()
props = model.load_properties()
for k in parameters:
props[k] = parameters[k]
model.save_properties(props)
model.run(
properties = osp.join("data", "out", f"{model.sim_id}.properties")
)
pd.DataFrame({
"observed": observed_data.w.values,
"simulated": model.results.w.values
}).plot.scatter("observed", "simulated")
<Axes: xlabel='observed', ylabel='simulated'>
After calibration via black-box optimisation, the predictive accuracy of our JFruit2 model is quite high. The model's behaviour more closely reflects that of reality.
Multi-objective optimisation¶
Let's repeat this process. But this time, we will incorporate a second objective for optimisation. Hence, we are performing multi-objective optimisation rather than single-objective optimisation. For more information, click this link.
Let's aim to minimise the discrepancy between both simulated and observed carbon-content in soluble sugar (mSol) and simulated and observed fruit water mass (w). We will use the mean pinball loss metric for mSol, and the MSE for w.
We can re-use our parameter specification from single-objective optimisation, but we'll need to define a new objective function.
def objective(
parameters: dict, simulation_id: str, observed_data: np.ndarray | None
) -> float | list[float]:
model = JFruit2()
props = model.load_properties()
for k in parameters:
props[k] = parameters[k]
model.save_properties(props)
model.run(
properties = osp.join("data", "out", f"{model.sim_id}.properties")
)
simulated_w = model.results.w.values
observed_w = observed_data.w.values
simulated_mSol = model.results.mSol.values
observed_mSol = observed_data.mSol.values
w_metric = MeanSquaredError()
mSol_metric = MeanPinballLoss()
w_discrepancy = w_metric.calculate(observed_w, simulated_w)
mSol_discrepancy = mSol_metric.calculate(observed_mSol, simulated_mSol)
return w_discrepancy, mSol_discrepancy
Let's run the multi-objective optimisation procedure, again using the TPES algorithm from the Optuna library.
specification = OptimisationMethodModel(
experiment_name="optuna_optimisation",
parameter_spec=parameter_spec,
observed_data=observed_data,
method="tpes",
output_labels=["Water Mass Discrepancy", "Carbon Content Discrepancy"],
directions=[
"minimize", "minimize"
],
n_jobs=1,
n_iterations=25,
method_kwargs=dict(n_startup_trials=10),
)
calibrator = OptimisationMethod(
calibration_func=objective, specification=specification, engine="optuna"
)
calibrator.specify().execute().analyze()
<calisim.optimisation.implementation.OptimisationMethod at 0x7faf30acae60>
We can see some useful visualisations for both our objectives. But we'll focus on the Pareto-front plot.
For single-objective optimisation, you will end up with a single best solution with a single optimal parameter set. The aim of multi-objective optimisation is to identify the Pareto optimal set between your different objectives. Formally, a solution is Pareto optimal if no other solution can improve one objective without worsening at least one other objective. Informally, we want every objective to "win", and no objective to "lose".
In the Pareto-front plot, we see the trade-off between the Water Mass discrepancy and Carbon Content discrepancy objectives. We could then select one parameter set from our Pareto optimal set depending on whether we wish to favour Water Mass discrepancy or Carbon Content discrepancy.